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Abstract 

Undernutrition, resulting in restricted growth, and quantified here using 
height-for-age z-scores, is an important contributor to childhood morbidity and 
mortality. Since all levels of mild, moderate and severe undernutrition are of 
clinical and public health importance, it is of interest to estimate the shape of 
the z-scores' distributions. 

We present a finite normal mixture model that uses data on 4.3 million chil- 
dren to make annual country-specific estimates of these distributions for under- 
5-year-old children in the world's 141 low- and middle-income countries between 
1985 and 2011. We incorporate both individual-level data when available, as 
well as aggregated summary statistics from studies whose individual-level data 
could not be obtained. We place a hierarchical Bayesian probit stick-breaking 
model on the mixture weights. The model allows for nonlinear changes in time, 
and it borrows strength in time, in covariates, and within and across regional 



country clusters to make estimates where data are uncertain, sparse, or missing. 

This work addresses three important problems that often arise in the fields of 
public health surveillance and global health monitoring. First, data are always 
incomplete. Second, different data sources commonly use different reporting 
metrics. Last, distributions, and especially their tails, are often of substantive 
interest. 



Keywords: complex survey data, hierarchical modeling, missing data, normal mixture 
model, stick-breaking prior 



1 Introduction 



Childhood undernutrition contributes substantially to the burden of disease in the 
low- and middle- income world (Black et al., 2008), and nutritional interventions have 
the potential to curb child mortality substantially (Gakidou et al., 2007). Nonethe- 
less, little has been known about the population-level distribution of anthropometric 
indicators in most countries and regions. 

An important indicator of nutritional status is a child's height-for-age z-score 
(haz), which is measured in standard deviations of the height-for-age distribution of 
the World Health Organization reference population (WHO, 2006). Stunting, defined 
as HAZ < —2, and especially severe stunting, defined as HAZ < —3, indicate chronic 
restriction of a child's growth and are associated with adverse physical and intellectual 
outcomes. In this paper, we develop a flexible model for the full distribution of HAZ 
scores that addresses several important challenges. 

A barrier to the rigorous population-level analysis of HAZ has been that some 
surveys report individual-level microdata, whereas others report one or more summary 
statistics describing their samples, such as average HAZ and/or the sample prevalence 
of stunting or of severe stunting. Here we present a novel Bayesian model that 
combines these disparate data sources. The model incorporates both individual- level 
data when available, as well as summary statistics from studies whose individual-level 
data could not be obtained, accounting for dependence when multiple summaries arc 
reported for a single study sample. 

In addition to being able to use data at varying levels of aggregation, we also wish 
to make coherent inference on both mean HAZ and on all levels of mild, moderate, and 
severe stunting, since the hazardous effects of undernutrition occur along a continuum. 
Exploratory analyses indicated that, across countries and ages, the observed stunting 



3 



prevalences tend to be somewhat lower than expected under normality. We concluded 
that a normal likelihood would not be universally appropriate for our dataset and, 
furthermore, that assuming normality could potentially induce a systematic bias in 
estimates of tail probabilities. Given that stunting prevalences are of substantive 
interest, this seems especially undesirable. We therefore developed a single, unified 
model that allows estimation of all outcomes of interest without assuming normality. 

There is a rich literature on flexible semi- and non-parametric Bayesian methods. 
Dunson (2010) provides a comprehensive overview, focusing on the popular class of 
Dirichlet process (DP) models. Many specifications allow for dependence across re- 
lated DPs (e.g., MacEachern, 1999; De lorio et al., 2004; Teh et al., 2006; Rodriguez 
et al., 2008). Semiparametric finite mixture models provide an accurate approxima- 
tion to the fully nonparametric class of DP models, with sufficient flexibility to flt any 
smooth density (Richardson and Green, 1997; Roeder and Wasserman, 1997). Two 
advantages of the semiparametric approach for our analysis are (a) computational 
feasibility for very large datasets and (b) avoidance of the 'fixed weights' assump- 
tion, which can impede efforts to build parsimonious models (Chung and Dunson, 
2009; Rodriguez and Dunson, 2011) and which may be inappropriate for analyses of 
unbalanced datasets (Venturini et al., 2010). 

Rodriguez et al. (2009) built from this hterature to specify a probit stick-breaking 
regression model on the weights of a finite normal mixture, keeping the basis distribu- 
tions constant across groups. An important contribution of our paper is the extension 
of their approach for analyzing microdata to include both microdata and summary 
statistics in a single model. Although a variety of Bayesian methods for meta-analysis 
(e.g., Dominici et al., 1999; Stangl and Berry, 2000; Higgins et al., 2009) are avail- 
able, to our knowledge this straightforward approach for combining disparate data 
sources has not been used previously. In addition, we extend the Rodriguez et al. 
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(2009) model to the multi-level context. Whereas Rodriguez et al. (2009) fix certain 
variance parameters a priori, we estimate all variances, allowing data-driven shrink- 
age and inducing an automatic preference for parsimony. Lastly, we specify a more 
flexible mean model, allowing for smooth non-linear change over time. 

In this paper, we focus on height-for-age as an example, producing annual es- 
timates of HAZ distributions for under- 5-year-old children in 141 low- and middle- 
income countries from 1985 to 2011. More generally, variations and trends in health 
outcomes and risks across the globe have received increased attention in recent years, 
in part driven by the UN's Millennium Development Goals, the increase in interna- 
tional funding for global health, and the demand for objective evidence about the 
comparative efficacy of interventions. Our methods are broadly applicable in this 
arena. For example, we are using this same model to analyze weight-for-age 2;-scores 
and underweight (defined as weight-for-age less than a cutoff) (Stevens et al., 2012) 
as well as hemoglobin and anemia (defined as hemoglobin less than a cutoff) (Stevens 
et al., in submission). 

In the next Section, we describe the HAZ dataset. Then, in Sec. 3, we detail the 
model's semiparametric likelihood and its hierarchical priors. Sec. 4 describes the 
MCMC algorithm, and Sec. 5 gives the results of our analysis of undernutrition. We 
cross validate the model and perform a sensitivity analysis in Sec. 6. In Sec. 7, we 
present two model extensions, and we conclude with a discussion in Sec. 8. 

2 The Data 

Our analysis of childhood height-for-age in low- and middle-income countries is based 
on a vast dataset described in Stevens et al. (2012). It includes 0.7 million individual- 
level records from health examination, nutrition, and household surveys, as well as 
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summary statistics providing information about an additional 3.6 million children, 
extracted from the WHO Global Database on Child Growth and Malnutrition and 
from reports of other national and international agencies. Despite comprehensive 
access, there are gaps in data availability, with only 126 of the world's 141 low- 
and middle-income countries contributing data and many countries having data for 
a hmited number of years in the 1985-2011 period. To facilitate estimation that fills 
in these gaps, we borrow strength geographically by grouping countries into regions, 
using the classification described in Stevens et al. (2012), and we borrow strength 
temporally using autoregressive models. This geographical sparsity is compounded by 
the fact that less than 85% of observations are from nationally-representative studies. 
Studies that are not nationally representative cannot be relied upon as unbiased 
estimates of country-level HAZ distributions, because researchers may have tended to 
select places with systematically more or less severe undernutrition. Furthermore, 
differences in HAZ distributions across communities in the same country contribute 
an additional component of variability. Yet another source of variability is due to the 
fact that only 80% of studies cover the full under-5-year-old age range. 

An important complication is that the data do not come from simple random 
samples but rather stratified cluster-based surveys developed for design-based analy- 
sis. In our applied work, we have considered two approaches to using such data in a 
model-based context. Both approaches are based on scaling the sample size in each 
survey to be equal to the effective sample size (ESS) of the survey to account for 
the loss of information due to correlation from clustered sampling (see Chen et al. 

(2011) for a similar approach). As described in the WebAppendix of Stevens et al. 

(2012) , we estimated the ESS as the sample size divided by the design effect, where 
the latter is an estimate of the ratio of the variance under the actual design to the 
variance under a simple random sample. For surveys providing only summary statis- 
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tics, we estimated the ESS by using the median design effect from surveys reporting 
individual-level data. 

Our initial strategy was to take a weighted (using the survey weights) resample 
from the observations, with the number of samples equal to the ESS; this strategy 
was used for our primary analysis (Section 3 and Stevens et al. (2012)). However, 
resampling introduces an additional component of variability, the randomness in the 
realized resample. For our analysis in Section 7.1 and in Paciorek et al. (in prep.), 
we developed a new approach of using a weighted likelihood, normalizing the survey 
weights to sum to the survey ESS. Because the surveys generally stratified by ur- 
ban/rural residence and were designed to provide valid estimates for both strata, we 
estimated ESS and scaled the weights separately for urban and rural data. As noted 
in Gelman (2007), the appropriate use of data from complex surveys in a modeling 
context is an area deserving of further research. 

3 The Model 

In this Section, we present a model that combines information from disparate and 
sparse data sources to estimate HAZ distributions for each country-year. We first ex- 
plain how we estimate HAZ distributions when individual-level data are available. We 
then describe the extension that allows us to include aggregated summary statistics, 
such as sample means and prevalences, when individual-level data are unavailable. 
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3.1 Semiparametric density estimation for dependent ran- 
dom distributions 

Let fi{z) denote the HAZ distribution in study i (i = 1, • • • , n) at time t[i] in country 
j[i] {j — 1, - ■ ■ , J). Following Rodriguez et al. (2009), we set: 

M+l 

M^) ^ ^WmiJ^{z\ Om: 0"^) (l) 

m=l 

( H^mi) Uu=l(^ - if m < M 

Wmi = < (2) 

[ n;ii(l-*(M) ifm = M + l 

^mi = S^jli] + ^mjl{\t[i] + Mmj[i]t[i] + Pm^Mm + O-mi + &mi- (3) 

Eq. (1) describes a finite mixture of M+l normal distributions, where the weights {w) 
on the constituent normal distributions vary across studies. We assign uniform priors 
to the 9m s and cr^'s (Gelman, 2006), and we constrain the cr^n's to be less than two 
times the standard deviation across all observed HAZ values. This constraint is placed 
to ensure propriety and is chosen to be noninformative. We constrain 6i < 62 < ■ ■ ■ < 
9m+i to ensure interpretability of the main effects and so that the interpretation of 
the a's doesn't change across iterations of the Markov chain Monte Carlo (mcmc) 
chains. 

We specify a probit stick-breaking model for the w's in (2). This transformation 
uses the standard normal cumulative distribution function ($) to transform a's that 
range between -00 and 00 to w's that range between and 1. Specifically, the a's de- 
termine the relative weights assigned to each cluster in the following manner: starting 
with a 'stick' of length one, ^{au) is the proportion of the stick that we break off and 
assign to Wn; ^{a2i) is the proportion of the remaining stick of length (1 — wu) that 
we break off and allocate to W2i; and so on. Larger values of ami thus correspond to 
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higher weights on the m mixture component in study i. The probit stick- breaking 
transformation allows us to place a flexible model on the a's, while ensuring that the 
w's still add to one. As an aside, note that although stick-breaking is widely used, 
it does have an interpretational drawback in analyses such as this one, in which the 
first mixture component is not necessarily expected to have the largest weight. An 
alternative is the additive log-ratio transformation (Aitchison, 1986, p. 113). 

In (3), we define ami to be comprised of six effects that allow us to leverage 
all available information in making estimates for each country-year pair. S^j is a 
country-by-component interaction term, determining the baseline weight placed on 
each of the M+1 normal distributions in country j. (p^j is a country- and component- 
specific linear time effect, determining the linear parts of country j's time trend. The 
hierarchical priors for the 6'^ and (p'^ terms are described in Sec. 3.2. Letting T = 27 be 
the total number of time-points of interest (1985, 1986, • • • , 2011), the T- vector u^j 
captures smooth nonlinear change over time in country j and mixture component 
m, with details provided in Sec. 3.3. l3m contains the effects for mixture component 
m of time- varying country-level covariates, x, (maternal education, national income, 
urbanization, and an aggregate metric of access to basic healthcare, as described in 
Stevens et al., 2012). The a's are study-specific random effects, and the 6's capture 
the extra variance of studies that do not fully cover the under- 5-year-old age range. 
These error terms and their variances are described in Sec. 3.4. 

3.2 Hierarchical intercepts and linear time slopes 

The model has a hierarchical structure: studies are nested in countries, which are 
nested in regions (indexed by k), which are, of course, all nested in the globe. For each 
mixture component, we specify hierarchical priors for the country-specific random 
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intercepts {S'^) and the country-specific linear time slopes {(fi'^), with country-specific 
effects centered around their region-level counterparts {6^' and (p'^) and region-level 
effects centered on the global means {S^ and (p^). For both the intercepts and the 
time slopes, we use a fat-tailed ^4 prior at the country level to allow for the fact that 
some countries may deviate substantially from their expected values given covariates 
and geographic hierarchy: 



The r's determine the degree of intercept {t^) and slope (r*^) shrinkage performed 
at the country- (r'^'' and r'^''), and region-levels {r^"^ and r'^'^). Whereas Rodriguez 
et al. (2009) set each r parameter equal to 1, we assign a uniform hyperprior to 
each of the r's (Gelman, 2006), truncating such that r > .00001 (as smaller values 
are not identifiable from one another). An advantage of our approach is that the 
data are able to inform estimation of the r's. This allows for different degrees of 
shrinkage depending on the signal-to-noise ratio in the data. We borrow strength 
across units, compromising between the overly-noisy unit-specific estimate and the 
overly-simphfied all- unit estimate (Gelman and Hill, 2007). 

Importantly, our specification favors a simplified model. In those cases when the 
data do not contain evidence to the contrary, the natural Bayesian penalty on model 
complexity (Jefferys and Berger, 1992) causes extraneous effects to be shrunk to zero, 
allowing the model to collapse down to a simplified specification. This is an especially 
desirable characteristic given that our model contains such a large number of random 
effects. 
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3.3 Nonlinear change in time 

HAZ distributions are expected to change smoothly over time. In country j and 
mixture component m, we capture smooth nonhnear change in time using the T- 
vector Umj. We model u^j hierarchically by defining it as the sum of country, region, 
and global components: Umj — + u^^^j-^ + u^. In order to allow the model to 
differentiate among the degrees of nonlinearity that exist at the country, region, and 
global levels, the T- vectors u^^, u^^^^, and are each given a second-order Gaussian 
autoregressive prior (Breslow and Clayton, 1993; Rue and Held, 2005) with mean 
zero and precision matrices A^P, AJ^^P, and A^P respectively The scaled precision 
matrix, P, penalizes first and second differences. 

For each of the precision parameters, we use a truncated flat prior on the standard 
deviation scale (l/y/X) as recommended by Gelman (2006). We truncate these priors 
such that logA < 12 for each A. This upper bound is enforced as a computational 
convenience: models with logA > 12 are considered to be equivalent to a model with 
logA = 12 as they have essentially no extra-linear variabihty (a.k.a. 'wigghness') in 
time. Furthermore, we order the A's a priori: A^ < AJ!„ < A^ for m = 1, • • • , M. This 
prior constraint conveys the natural expectation that, for example, the global HAZ 
trend has less extra-linear variability than the trend of any given region. 

The matrix P has rank T — 2, corresponding to a flat, improper prior on the mean 
and time slope of the rt's (Wood, 2006, p. 191). Thus we have a proper prior in a 
reduced-dimensional space, P{u\X) oc A— exp (-|w'Pu) (Rue and Held, 2005). We 
constrain the mean and slope of each u-vector to be zero to avoid non-identifiability 
with the S and (p terms. 
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3.4 Error terms 

We include random effects for nationally-representative studies to capture the fact 
that even after accounting for sampling variability, a nationally-representative study 
may still not reflect its country's HAZ distribution with complete accuracy, due to 
unobserved study design and measurement issues (see, for example, the 2002 study 
from China in Fig. 1). The study-specific random effect, a^j, allows mixture com- 
ponent m in study i to have an unusually high or an unusually low mixture weight 
after accounting for the other terms in the model. We assign independent t priors to 
the a's to account for the fact that some unusual studies may deviate substantially 
from their expected country-year mean: ami ~ ^4(0,Vmi)- The scale parameter, v^i, 
depends on whether the study is national or subnational: 



Random effects from nationally-representative studies are constrained to have less 
variability than random effects from other studies {v^ < ior m — 1, ■ ■ ■ ,M). 
Thus the random effects for non-nationally-representative studies account for both 
the non-sampling errors mentioned above and for within-country heterogeneity. 

For studies that do not fully cover the under-5-year-old age range, we include 
an additional error term for each mixture component to capture variability of HAZ 
distributions across ages; b^i ~ ^4(0, v^^- For studies that do cover the full age range, 
we set bmi — ior m — 1, ■ ■ ■ , M. 

Note that to make country-level predictions, we set a = 6 = 0, thus removing 
the random effects due to imperfections in study design and to within-country and 
across-age variability of haz distributions. 




f ^ if study i is nationally representative 



v: 



m 



otherwise. 



12 



3.5 The joint likelihood of a sample's mean and tail proba- 
bilities 

Recall that a motivating goal of this analysis is to specify a model that incorporates 
both individual-level and aggregated data. In this Section, we extend the model 
presented above to incorporate summary statistics, accounting for dependence when 
multiple summaries are reported for a single sample. 

In the case of our example analysis, this enables us to include data from 349 
studies for which we were unable to obtain microdata. The sample sizes are large, 
ranging from 200 to 643,200 children, with a median sample size of 2,000. For each 
study, we have one or more of the sample mean HAZ, sample prevalence of stunting 
(haz < —2), and sample prevalence of severe stunting (haz < —3). Since each of 
these statistics can be written as the sum of a large number of independent random 
variables, we invoke the multivariate central limit theorem, giving us a multivariate 
joint normal likelihood for the summary statistics of each sample. Importantly, this 
specification accounts for dependence across multiple summaries reported for a single 
study sample, e.g., a sample's mean and one or more of its sample prevalences. We 
derive the mean and covariance of the normal distribution in the Appendix, using the 
ESS in place of the actual sample size to account for the complex survey designs. 

We conducted a simulation study (results not shown) showing that the aggregated- 
data sample sizes are large enough for multivariate normality to be a reasonable 
approximation of the true sampling distribution of the summary statistics. This 
approximation is least accurate in those studies that have both a small sample size 
and a low stunting prevalence. 

To obtain the full hkehhood, allowing coherent posterior inference that incorpo- 
rates data reported at both levels of aggregation, we simply multiply the microdata 
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likelihood with the aggregated-data likehhood. 

4 Computation 

We fit the model via MCMC, programming the sampler using the statistical computing 
language R (R Development Core Team, 2008). Algorithms to fit mixture models often 
introduce latent variables that break the mixture and thus enable Gibbs sampling 
(Rodriguez et al., 2009). However, since our dataset includes summary statistics, 
we would have had to impute the individual-level values underlying each summary 
statistic in order to allow Gibbs sampling. Given the large sample sizes and the fact 
that the imputed samples would have had to satisfy data-imposed constraints on their 
means and tail probabilities, this option was not feasible. 

As an alternative, we considered a Metropolis-Hastings algorithm that breaks the 
mixture for the microdata and then proposes model parameters from their closed-form 
full-conditional distributions given the microdata only. In such a high-dimensional 
model though, good mixing relies heavily on an algorithm that makes moves with 
a covariance roughly proportional to the full posterior covariance. Since these pro- 
posal covariances did not take the aggregated data into account, mixing under this 
algorithm was very poor. 

In the end, we therefore opted for a Metropolis-within-Gibbs algorithm. We used 
an adaptive Metropolis algorithm due to Shaby and Wells (2011) to update the a's, 
and we used Gibbs sampling for the parameters in the hierarchy. Note that a key 
strategy to achieve better mixing was to jointly sample 6 and cr, the means and 
variances of the normal mixture components, because of the strong dependence across 
these parameters. Also, for each m, wc jointly sampled each of {AJ^, {'itr«j}i=i, ■■•,>/}; 
{A^, {u'[^f^k=i,...,K}-i and {A^, tt^} because of their strong cross-level dependence. We 
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did not marginalize over mean parameters in the model, despite having some normal 
conditional distributions, because this would have introduced off-diagonal structure 
into the likelihood covariance, requiring manipulation of large covariance matrices to 
calculate the marginal densities. 

We started five chains in parallel at randomly-selected starting values. We phased 
blocks of parameters into the MCMC sequentially, allowing each block to adjust to the 
data before constraining it by starting the next block. In particular, we started the 
a's first, holding all other parameters constant at their starting values. This allowed 
the a's to reach reasonable 'pseudo-data' values before we began updating the ^'s, 
cr's, and the country-level effects. 

After a cumulative burn-in of 22,000 iterations, we ran each chain for 30,000 more 
iterations. We then combined the five chains and thinned by a factor of 30 to obtain 
chains of length 5,000 with which to summarize the posterior. Due to the lack of 
identifiability inherent in mixture likelihoods, the 9 and a parameters mixed least 
well - we obtained only 23-208 effectively-independent samples of these parameters. 
The imperfections in mixing of the mixture component parameters did not impact 
mixing at the level of inference however. The country-year HAZ means, for example, 
had an average of 3962 effectively-independent samples. 

5 Results 

We find that across age groups and countries, HAZ distributions are unimodal and 
roughly symmetric (see Fig. 1 for a few examples). As a result, (M-|-l)=5 clusters 
provide sufficient flexibility to fit these data. We present a sensitivity analysis of our 
results to the value of M in Sec. 6. 

In the first row of Fig. 1, we demonstrate how the semiparametric weighted- 
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Figure 1: Three examples of country- level results. The estimated HAZ density in 

the year 2000 is shown in the first row as a solid black line, with pointwise 95% 

uncertainty bands in grey. The raw data is estimated with a kernel density estimator 

(black dashed hne) for comparison. The last three rows show trend estimates (black) 

and their uncertainties (grey) for three summary statistics of interest. Data points are 
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indicated by symbols, with 95% uncertainty intervals reflecting sampling variability. 



mixture model fits three example HAZ densities. Note that in Colombia and Gabon, 
where data are richer (with n=3,221 and n=2,251 individual-level year-2000 observa- 
tions, respectively), the modeled densities follow the data-based empirical densities 
more closely than in China, where data are relatively sparse (n=227 individual-level 
year-2000 observations). In Colombia, where the density is tighter, fewer component 
normal distributions are given non-negligible weight relative to China and Gabon, 
where the densities are more diffuse. 

Similar lessons can be drawn by examining the model's fit to the summary statis- 
tics shown in the last three rows of Fig. 1. We achieve tighter fits in data-rich China 
(total n==lll,471 across all data sources) and Colombia (total n=23,680) than in 
relatively data-poor Gabon (total n— 2,251). Despite Gabon's data sparsity, we are 
nonetheless able to make reasonably certain inference, with estimates informed by co- 
variates, by the rich data available from other countries in Sub-Saharan Africa and, to 
a lesser extent, by data from other regions as well. The uncertainty of these estimates 
reflects the degree of country-to-country and region-to-region variability observed in 
other parts of the dataset. 

A motivating goal of this analysis is to estimate population-level distributions of 
HAZ, not just at the the country level, but also for regions and for the globe as a 
whole. In Fig. 2, we show region- and global-level results. These estimates are calcu- 
lated using weighted averages of each region's constituent country-level distributions, 
weighting by the size of the country's population. All 141 of the world's low- and 
middle-income countries contribute to these calculations, including the 15 countries 
that have no data. This allows us to reflect our uncertainty due to data sparsity, in 
contrast to models that deterministically set risk factor levels in countries without 
data to some pre-specified value (e.g.. Mason et al., 2005). 

Our substantive findings on undernutrition in children are discussed in full in 
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Figure 2: Population-weighted region- and global-level trend estimates (solid colors) 
and their pointwise 95% uncertainty bands (transparent colors). 



Stevens et al. (2012). In short, we find that although anthropometric status in devel- 
oping countries has improved on average between 1985 and 2011, worldwide progress 
has been uneven. The largest absolute improvements in HAZ means occurred in 
Asia and the largest relative reductions in prevalence in Southern and Tropical Latin 
America. Anthropometric status worsened slightly in sub-Saharan Africa from the 
mid 1980s to the late 1990s, when it began improving. In 2011, 314 million (95% 
uncertainty interval 296-331 million) under-five children were mildly, moderately or 
severely stunted. Most of these children lived in South Asia and sub-Saharan Africa. 



6 Model checking 

Out-of-sample performance is of particular relevance in this analysis because (a) such 
a complex and highly-structured model is at risk for overfitting, and (b) one of the 
primary purposes of our modeling efforts is to make predictions and report uncer- 
tainties for country-years without data. For these reasons, we conducted an extensive 
cross-validation of our model as well as two other candidate models. 
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Our first alternative candidate model was actually a trio of models that made 
separate estimates for means and for each of the two prevalences without sharing 
information across them in the form of a distribution. Microdata were summarized 
and combined with grouped data. Means and prevalences (using a probit regression 
formulation) were each modeled separately using a hierarchical Bayesian model built 
upon (3). These separate models for each reporting metric had the advantage of 
mixing quickly and being easy to describe to global health researchers. Due to concern 
about the predictive power of our covariates, our second alternative candidate model 
was the same as the primary model but excluded all covariates. 

Our five-fold cross validation consisted of two tests. In Test 1, in each of the five 
folds, we held out all data from non-overlapping subsets of 10% of countries with data 
(i.e., created the appearance of countries with no data where we actually had data), 
with held-out data from a mix of countries that were data rich (> 6 years of data 
with at least one year of data after 1999), data poor (< 3 years of data), and that had 
average data density (4-5 years or > 6 years of data with no data after 1999). We fit 
the model to the data from the remaining 90% of countries and made predictions of 
the held-out observations. We calculated the absolute and relative differences between 
our predictions and the held-out data. 

The hold-out algorithm for Test 2 was designed to examine how well the model 
predicted mean and prevalence < -3 when only prevalence < -2 was available, as 
commonly occurred. (See, for example, the 1989 study from Colombia, in Fig. 1.) 
For countries with data on mean as well as prevalences below -2 and -3, we maintained 
prevalence < -2 and held out one or both of the other two metrics, such that a total 
of 10% of all datapoints were excluded. We fit the model to the remaining 90% of 
the dataset and made predictions of the held-out observations. 

As shown in Table 1, our main model (Ml) predicted the known-but- masked truths 
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well - the magnitudes of the errors are very similar to the median difference between 
two national surveys in the same country-year; on average they are smaller than the 
median difference between two surveys in the same country-year when all national 
and subnational surveys are considered. Furthermore, our model made better out- 
of-sample predictions than the model without covariates (M3) when we held out all 
studies from a given country (Test 1). By contrast, our model did not make better 
out-of-sample predictions than the trio of separate models (M2) when we held out all 
studies from a given country (Test 1). We hypothesize that the predictive accuracy 
for a new country for a given metric of interest is robust to the differences between Ml 
and M2 because these predictions rely on covariates and on the region-level estimates 
for the metric, which are likely driven primarily by data on the metric of interest 
rather than by data on other metrics. 

Importantly, the main model had lower error than the trio of separate models in 
predicting held-out mean and prevalence < -3 (the more-commonly missing indica- 
tors) when prevalence below -2 was observed (Test 2). This advantage arises because, 
by estimating the whole distribution, the mixture model leverages the information 
contained in one metric to make estimates for another metric that was not observed. 

In both tests, we also examined the validity of the main model's 95% uncertainty 
intervals, checking whether 95% of held-out values were included in the 95% uncer- 
tainty intervals. Overall, the 95% uncertainty intervals of our estimates included 
92-95% of held-out study HAZ means and prevalences in Test 1 and 95-98% in Test 
2, consistent with the expected 95%. While these results are reassuring, we note that 
the cross-validation can only assess our quantification of predictive uncertainty in re- 
lation to the observed data. The presence of additional variability (beyond sampling 
variability) related to shortcomings in study quality in the nationally-representative 
studies makes it difficult to assess our quantification of uncertainty in the true country- 
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Median absolute error 


Median relative error 


n 


Ml 


M2 (p) 


M3 (p) 


Ml 


M2 (p) 


M3 (p) 



Test 1: Holding out all studies from 10% of countries 



mean HAZ 


181 


0.21 


0.20 


0.16 


0.28 


0.00 


0.15 


0.14 


0.56 


0.19 


0.00 


%HAZ<-2 


292 


6.02 


5.94 


0.36 


8.05 


0.00 


0.17 


0.17 


0.19 


0.25 


0.00 


%HAZ<-3 


202 


3.62 


3.40 


0.57 


4.12 


0.00 


0.27 


0.26 


0.26 


0.30 


0.00 


Test 2: Holding out mean 


and %<-3 when %<-2 


is known 






mean HAZ 


83 


0.10 


0.16 


0.00 


0.10 


0.72 


0.10 


0.15 


0.00 


0.10 


0.89 


%HAZ<-3 


111 


1.76 


1.86 


0.00 


1.76 


0.19 


0.16 


0.19 


0.01 


0.16 


0.30 



Table 1: Errors in predictions of held-out data for three candidate models. Ml is 
the main model; M2 is a trio of separate models for mean and for each of the two 
prevalences; M3 excludes all covariates. n gives the number of held-out observations. 
Prediction errors were compared across models using the non-parametric Wilcoxon 
signed-rank test for paired data (as discussed in Dietterich, 1998), with p- values in 
the (p) columns for the null hypotheses that the median differences between the errors 
of M2 and M3 with Ml are zero. The tests were conducted assuming independence of 
the held-out values. The p- values should therefore be interpreted as an approximation 
because there is some dependence among the held-out observations. 
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level trends. 

In addition to the cross validation, we also conducted a sensitivity analysis to 
assess the extent to which our estimates changed in a model that included two addi- 
tional mixture components. Estimates from the 7-component model were very similar 
to those of the 5-component model, with a median absolute difference of mean HAZ 
equal to 0.03 and median absolute difference of stunting and severe stunting preva- 
lences both equal to 0.63 percentage points. Of course more rigorous methods exist for 
choosing a fixed value for M (e.g., Kass and Raftery, 1995; Spiegelhalter et al., 2002), 
including estimation of M as a model parameter (Richardson and Green, 1997). 

7 Model extensions 

7.1 'Mixture of mixtures': Including data from exclusively 
urban and exclusively rural populations 

The dataset described in Stevens et al. (2012) excluded studies that were from selected 
subgroups that might have had lower or higher nutritional status than the general 
population. In particular, studies of exclusively urban or exclusively rural populations 
were omitted from the analysis. In this Section, we present an extension of the main 
model that allows the inclusion of these urban- and rural-only datapoints and that 
allows separate inference for each subgroup. 

We let s index 'strata', with s — u corresponding to an urban stratum and s — r 
corresponding to a rural stratum. We let h index observations: In a grouped-data 
study that combines urban and rural individuals, there would be a single observation; 
in a grouped-data study that reports results separately for urban vs. rural individuals 
there would be two observations. As before, i indexes studies, j indexes countries. 
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and k indexes regions. The likelihood for data from stratum s from study i is: 

M+1 

fsi{y) = XI ^"^^^ ^{y\^m, 0-^) ■ 

m=l 

Consider a grouped-data observation h that includes both urban and rural data. 
Let Pu be the proportion of urban dwellers in that country-year, and let Pr be the 
proportion of rural dwellers, such that Pu + Pr — 1- The resulting density for ob- 
servation h is a mixture of two (M-l-l)-component mixture densities. The crux of 
this 'mixture of mixtures' extension is that this density can be written as a simple 
(M-l-l)-component mixture density as in the original model: 

M+1 M+1 

fh{y) = p« X 

'^mri\h] •A/'(y|^m,(^^) 

m=l m=l 
M+1 

= {PuWjnuiih] +PrWmri[h\) {yl^m, (^m) ■ 

m=l 

This trivial manipulation allows us to include data from selected subgroups with- 
out altering the fundamental structure of our model. Furthermore, it allows us to 
make inference separately by stratum and to use the population weights to combine 
inferences across strata and report country-level estimates as before. 

We model the urban/rural effect flexibly by adding three new terms to the ex- 
pression for a. 7*^ is a country- and component-specific stratum effect, and p"^ allows 
this effect to vary linearly with time (we assume no nonlinear effect for simplicity). 
Cjni is a study- i-specific error in the urban/rural effect. This results in the following 
expanded expression for a: 

where Igi is a centered stratum indicator for study i: 




1 if stratum s contains urban individuals in study i 
— 1 if stratum s contains rural individuals in study i. 
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The 7's and p's are modeled hierarchically, with the same structure as described 
in Sec. 3.2 for the 5's and 



The c's are modeled analogously to the 6's, with c^j ~ t4{0, v^). 

An important advantage of the centered stratum indicator parameterization is 
that (5^'s can capture country-to-country differences in HAZ densities while the 7^'s 
capture country-to-country differences in the magnitude of the difference between 
urban and rural populations. In a scenario where there's a single, global effect of 
the urban- rural difference, 7^, then and can both go to zero. A non-centered 
formulation would have included an embedded assumption that HAZ distributions of 
one stratum are more variable than those of the other stratum. 

Based on MCMC samples from the posterior of the expanded model described 
above. Fig. 3 shows trends in the urban-rural difference in mean HAZ and in the 
prevalence of stunting by region and for the globe as a whole. First, we note that 
there is an urban advantage in all regions, with Southern and Tropical Latin America 
showing less of a gap than other regions. In most regions, there is evidence of a 
narrowing of the urban-rural gap over time, with the strongest evidence and largest 
decreases in the gap in Southern and Tropical Latin America and (since the 1990s) 
East and Southeast Asia. Interestingly, the global advantage in growth for urban 
children is stronger than the regional advantages, reflecting the fact that regions that 
are more heavily urbanized tend to have higher HAZ, such that the global urban- rural 
difference reflects both within-region differences in growth and cross-region differences 
in urbanization and growth. 

This 'mixture of mixtures' approach generalizes to allow inclusion of other selected 
subgroups. For example, this method would allow us to treat age in a more nuanced 




Pmk ~ ^ {pL V 



m ) 1 
m ) 
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mean(HAZ) P(HAZ<-2) P(HAZ<-3) 




1985 1990 1995 2000 2005 2010 1985 1990 1995 2000 2005 2010 1985 1990 1995 2000 2005 2010 

Figure 3: Population- weighted region- and global-level urban-rural difference esti- 
mates with pointwise 95% uncertainty bands. Oceania is omitted to reduce clutter 
and because of large posterior uncertainty. 



fashion, rather than simply categorizing studies according to whether they covered 
the full under-5-year-old age range. Instead, we could make age-specific estimates 
for each age stratum, and combine across strata via 'mixture of mixtures' for studies 
that report HAZ in broad age bands and for the purposes of making country-level 
inference. 

7.2 'Sloshing': Including main-effect terms in the expression 
for OL 

Note that each term in the expression for ami (3) is mixture-component specific. In 
particular, covariate effects (/3m) and country-level intercepts (5^^) ^^"^ linear time 
trends (v^mj) are each estimated separately by mixture component. In this Section, 
we discuss the addition of 'main effects' (/3°, b'f ^ and ^'f ^ respectively) that apply 
across all mixture components, as in Rodriguez et al. (2009). Adding these three 
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t = 1994 t = 2000 t = 2002 t = 2004 t = 2010 



ft=Ew,mXfm 




T — I — I — I — I — I — r T — I — I — I — I — I — r T — I — I — I — I — I — r n — i — i — i — i — i — r n — i — i — i — i — i — r 

-3 -2 -1 1 2 3 -3 -2 -1 1 2 3 -3 -2 -1 1 2 3 -3 -2 -1 1 2 3 -3 -2 -1 1 2 3 

HAZ 



Figure 4: The HAZ distribution at a given time (denoted here as ft and depicted in 
black), is a weighted mixture of M + 1 (here three) normal distributions (shown in 
red, green, and blue). The mixture weights, w, vary in time, while the locations and 
scales of the component normals remain constant. Here, the effect of time is linear 
on the probit stick-breaking scale of the a's. This results in a 'sloshing' of the HAZ 
distributions, with - for positive ip, as depicted here - mean HAZ (shown in grey) 
decreasing in time. 

effects to the expression for ami yields 

ami = + 6^j^] + ifft[i] + (fmj\i]t[^] + UmMt\i] + (3^'xi + f3'^Xi + a^i + hmi- 

We note that - though (3^ and ip'^ capture linear effects of covariates and time on the 
a's - their effects are not linear on the scale of the w's. For example, in Fig. 4, we 
show that the linear main effect of time (v^^^) results in a 'sloshing'-type movement 
in the HAZ distributions. For even more flexibility, one could also introduce nonlinear 
main effects for the u terms. 

We propose extending the Rodriguez et al. (2009) model to the hierarchical setting 
by specifying priors for the main effects 5'^^ and ip^ that are analogous to those 
described in Sec. 3.2 for the component-specific terms, 5'^ and ip^: 
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We place flat, improper priors on 5^°, </7^°, and /3°. We constrain the mean of each 
set of component-specific 5's, 93's, and /3's to be zero to avoid non-identifiabihty 
with their corresponding main effects. For example, for each country j, we constrain 
XI m ^mj = to avoid non-identifiabihty with 5f. 

An advantage of this specification is that the main effects parsimoniously capture 
variation along the 'sloshing' axis, whereas the interaction terms provide the added 
fiexibility needed to describe any residual variability across shapes of distributions. 
In cases when this flexibility is not needed, i.e., when sloshing suffices to describe 
all variation across shapes of distributions, the model can zero out the interaction 
terms. In our analyses to-date we have not included these main effect terms because 
we were concerned that with sparse data, the interaction terms might be over-shrunk. 
In particular, given substantive interest in the tails, we wanted to ensure that data 
on the tails (and not data on the mean) was most influential in making inference 
about the tails. However, we feel that this speciflcation of main effects as well as the 
sensitivity of our modeling results to inclusion of the main effects are worth additional 
methodological investigation. 

8 Discussion 

Efforts to improve global health will depend on monitoring of health outcomes, and 
much of the monitoring will be based on incomplete data from disparate sources. 
Substantive interest will often focus on the distributions of health indicators, and 
especially on their tails. 

We have specifled a semiparametric model for estimating population-level distri- 
butions of these indicators. This flexible structure allows us to estimate chnically- 
important tail prevalences without imposing parametric assumptions. It also allows 
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us to combine data sources that are aggregated in different ways and reported us- 
ing a variety of metrics, accounting for redundancies across summary statistics tliat 
describe the same sample. We borrow strength in time, covariates, and within and 
across regional country clusters to make inference where data are sparse or missing. 
The method naturally extends via our 'mixture of mixtures' approach to handle addi- 
tional stratification, including data that are reported separately by stratum and data 
that are summarized across strata. 

A methodological innovation is the extension of the Rodriguez et al. (2009) model 
to the fully hierarchical context. Though they refer to their model as 'hierarchical', it 
does not have multiple nested levels, and it does not allow borrowing of strength in a 
data-driven manner, since the variances of all groups of effects are set to one a priori. 
We generalize their model to the multi-level context and estimate all variance compo- 
nents. An important consequence is that our model automatically favors parsimony, 
an especially desirable characteristic given that the model space is so large. 

A potential weakness of our model is our reliance on asymptotics in the speci- 
fication of a joint normal likelihood for the aggregated-data summary statistics. In 
particular, for those studies with small sample sizes and low prevalences of stunting, 
the assumed normal likelihood provides only a rough approximation to the statistics' 
true sampling distribution. We note, however, that it is oftentimes the case that the 
parametric family of a likelihood or prior is only an approximation. 

A complication that we have ignored in this analysis is measurement error in chil- 
dren's heights, which may be especially pronounced for very young babies. This likely 
leads us to overestimate the variability of the true distributions and to overestimate 
the tail prevalences. 

While our confidence in the model is buoyed by the cross-validation results that 
indicate that our inference reflects the important sources of variability, there are 
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nonetheless a number of potential model improvements that are beyond the scope of 
this analysis. These include consideration of additional covariates, non-linear covari- 
ate effects, and covariate interactions, including covariate effects that vary by region 
and time. In addition, while data sparsity led us to assume that a number of model 
parameters were constant across region, it would be worthwhile to investigate allow- 
ing the country-level variance components, including the autoregressive smoothing 
parameters, to vary by region. 

Finally, we note that in addition to analysis of height-for-age and weight-for-age 
(Stevens et al., 2012), a third standard metric is weight-for-height z-score (WHz), 
an important measure of short-term nutrition, including famine. Modeling outcomes 
such as WHZ that vary rapidly in time is difficult because real changes are likely 
poorly-identified relative to study design and measurement issues that can lead to 
large differences between surveys conducted in the same year and between one year 
and the next, even for large nationally-representative surveys. 
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Appendix: Moments of the joint normal likelihood 



of a sample's mean, variance, and tail probabilities 

Note: for clarity, the notation in this Appendix differs slightly from that of the main 
body of the paper. 

Wc take yi,--- ,yn to be a sample of n observations from a mixture density: 
fiy) — Yl^=i'^cJ^iy\^c,o''^)- For ease of notation, let the constituent normal distribu- 
tions yc ~ M{9c-, u"^) have probability density function and cumulative distribution 
function F^. We begin by determining the means and variances of each of the two 
types of summary statistics. We then turn our attention to the covariances across the 
statistics. 

Sample Means: Define = ^ yi- A property of mixture distributions is that 
y's mean is just a weighted sum of its components' means: 



Let the variance of y be denoted cr^. Then 

2 def X j-,r 2 



- V{y)^E[y']-e^ 



where the last equality is a property of normal mixtures. 



By the central limit theorem (clt), we have that T„ 



approx 



Sample Prevalences: Define Tp{x) — ^ Yll^=\^{yi ^ Then — Ylic=i'^cFc{x), 
and by the CLT: 



Qov{Tp{xi),Tp{x2))- Consider two cutoffs, Xi < X2- 

1 



Px{^ -Px) 



n 



Cov{Tp{xi),Tp{x2)) 



l{yi < xi} I{yj < X2} 



,i=i 



PxiPx2 



n 



i{y < xij^Hvi < ^2} 



Px\Px2 



n 



1 



E\l{yi < xi}l{yi < X2} + 
{n - l)l{yi < xi}I{yj < X2}] - Px^P. 



= -{Pxi + {n- l)pxiPx2) - PxiP. 



X2 



= -Pxi(i-Px2)- 



n 
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Calculation of the last covariance requires a bit of additional notation. Let dc 
denote the mean of the truncated J\f{y\9c, < x} distribution: 

"cx "c 7-1 / \ ■ 

Fc{x) 

Let 9x denote the mean of the truncated mixture distribution, ^fyl{y < x}: 

1 



Px 



yfy{y)dy 



X J —oo 

C 



1 . ^ 

= — J^^c / yfc{y)dy 



1 ^ 

— ^wcFc{x)ec 

c=l 



CoY{Tm,Tp{x)): 

Cov{Tm,Tp{x)) 



n 

v? 
1 



n n 



OPx 



E[y,Y,^{y^<^} 



-Op, 



E [yil{yi <x} + {n- l)yil{yj < x}] - Op, 



= - (£" [yi\yi <x]px + {n- l)9px) - Op, 



n 



^ (OxPx + {n- l)Opx^ - Opx 
x-O). 



n 

Pxfn 



This gives us the full joint likelihood for the aggregated data: 



( T \ 

-*- m. 



T 

■J- 1 



Xi 



approx ^ 



/ \ 

\ PX2 ) 



( 



1 

n 



\ 



Pxi{Oxi-0) Pxi{l-Pxi) 
\ Px2{0x2 - 0) -PX2) PX2(1 -Px^) ) 
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